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ABSTRACT 


ay  <v  vrr* 


This  paper  examines , penalized  likelihood  estimation  in  the  context  of 
general  regression  problems,  characterized  as  probability  models  with 
composite  likelihood  functions.  The  emphasis  is  on  the  common  situation  where 
a  parametric  model  is  considered  satisfactory  but  for  inhomogeneity  with 
respect  to  a  few  extra  variables.  A  finite-dimensional  formulation  is 
adopted,  using  a  suitable  set  of  basis  functions.  Appropriate  definitions  of 

r 

deviance,  degrees  of  freedom,  and  residual  are  provided,  and  the  method  of 
cross-validation  for  choice  of  the  tuning  constant  is  discussed.  Quadratic 


approximations  are  derived  for  all  the  required  statistics. 
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SIGNIFICANCE  AND  EXPLANATION 

Statisticians  and  their  clients  have  considerable  experience 
constructing,  fitting  and  interpreting  parametric  models  for  their  data.  But 
in  many  situations  a  completely  parametric  model  is  inappropriate.  This  often 
arises  when  a  parametric  relationship  can  be  justified  on  grounds  of  theory  or 
experience,  but  is  suspected  of  varying  slowly  in  time  or  space.  The 
statistician  is  then  reluctant  either  to  abandon  the  parametric  model,  which 
is  credible  and  useful,  or  to  force  a  particular  dependence  on  time  or  space 
into  the  model,  without  guidance  on  the  form  of  this  dependence.  A  semi- 
parametric  approach  is  needed. 

In  this  paper  the  method  of  maximum  penalized  likelihood  is  advocated  for 
such  problems,  combining  the  ideas  of  fitting  parameters  by  maximizing 
likelihood  whilst  smoothing  with  respect  to  the  extraneous  variables.  This 
method  is  well-known  in  non-parametric  linear  regression,  where  it  includes 
spline  smoothing,  and  also  in  such  problems  as  non-parametric  density 
estimation.  The  present  context  is  a  rather  general  class  of  regression 
models,  allowing  nonlinearity,  statistical  dependence,  and  arbitrary  error 
distributions. 

A  basic  algorithm  for  fitting  the  model  is  derived,  and  asymptotic  theory 
briefly  dicussed.  Various  related  statistics  facilitating  assessment  of 
goodness-of-fit  and  determination  of  the  appropriate  degree  of  smoothing  are 
constructed,  together  with  quadratic  approximations  likely  to  make  numerical 
computational  economical. 


The  responsibility  for  the  wording  and  views  expressed  in  this  descriptive 
summary  lies  with  MRC,  and  not  with  the  author  of  this  report. 


PENALIZED  LIKELIHOOD  FOR  GENERAL  SEMI-PARAMETRIC  REGRESSION  MODELS 

Peter  J.  Green 

1.  INTRODUCTION 

It  frequently  arises  that  a  statistician  has  some  faith  in  the  validity 
of  a  certain  parametric  statistical  model  for  his  data,  but  for  some  suspected 
inhomogeneity  with  respect  to  one  or  more  extraneous  variables.  Typically, 
such  variables  might  represent  space  or  time,  the  relationship  between  them 
and  the  response  is  not  of  primary  interest,  and  the  statistician  is  inhibited 
from  extending  his  parametric  model  to  encompass  them  because  of  a  lack  of 
experience,  information  or  theory  about  the  form  of  their  relationship.  A 
simple  example  might  arise  with  binomial  data  from  different  geographical 
locations  where  it  might  be  quite  reasonable  to  model  the  response  probability 
as  a  logistic  regression  on  various  explanatory  factors  or  covariates,  but 
influenced  also  by  environmental  effects,  unknown  in  form  but  believed  to  vary 
smoothly  with  location. 

In  such  situations,  procedures  derived  from  penalized  likelihoods  (Good 
and  Gaskins  (1971),  Silverman  (1984))  may  well  be  appropriate.  The  purpose  of 
this  paper  is  to  examine  properties  of  such  methods  in  the  context  of  the 
rather  general  class  of  regression  models  used  by  Green  (1984),  characterized 
as  probability  models  expressed  as  composite  likelihood  functions.  (This  is 
not  to  claim  that  such  a  view  of  regression  is  universally  appropriate).  The 
methods  discussed  combine  the  ideas  of  fitting  the  parametric  part  of  the 
model  by  maximizing  likelihood  whilst  smoothing  with  respect  to  the  extraneous 
variables . 
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We  consider  a  log-likelihood  function  L(y;t|>)  for  the  data  y  in  term 
of  a  n-vector  of  predictors  \Ji.  It  will  be  helpful  to  allow  the 
dimensionality  of  the  data  vector  to  be  greater  or  less  than  n.  The  form  of 
this  probability  model  will  not  be  seriously  questioned,  but  rather  the  focus 
of  attention  will  be  on  the  dependence  of  on  explanatory  factors  and 
covariates,  symbolised  by  x,  and  extraneous  variables,  denoted  by  t.  we 
suppose  that  i|i  has  a  prescribed  functional  form  in  terms  of  x,t,  a 
p-vector  of  unknown  parameters  £,  and  an  unknown  real-valued  function  y 
defined  on  the  space  in  which  t  is  measured  (typically  R1  or  R2). 

Thus  the  complete  model  is 

L(yj^(x,$,t,Y) )  =  L(\p(g,Y))  ,  (1.1) 

say,  where  the  observed  y,x  and  t  may  be  omitted  from  the  notation.  Here 
only  8  6  if  and  Y,  lying  in  some  prescribed  linear  space  of  functions 
6,  are  unknown  and  our  principal  interest  is  in  g. 

For  an  example,  consider  a  logistic  regression  model  in  which  the 
'intercept'  term  varies  in  time.  Then  if  the  ith  observation  is  of  y^ 
successes  out  of  m^  trials,  with  covariates  {x^}  recorded  at  time  tif 
we  would  write 

n 

L(  +  )  «  l  {ytlog  +  (m.  -  y,  )log(  1  -  \|».  )} 
i-1 

where 

-  {l  +  exp(-  l  xijgj  -  Y(ti))}"1  . 

This  simple  problem  is  typical  of  many  where  maximum  likelihood  leads  to  over¬ 
fitting,  in  the  absence  of  any  restriction  on  the  form  of  the  function  Y*  At 
least  there  will  be  unidentifiability  of  parameters;  possibly  a  'Dirac 
catastrophe'.  In  the  context  of  density  estimation.  Good  and  Gaskins  (1971) 
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proposed  maximizing  instead  the  penalized  likelihood 

P  =  L(4»(0#Y))  “  XJ(Y)  (1.2) 

where  J  is  a  roughness  penalty,  increasing  as  the  function  y  becomes  less 
smooth,  and  X  is  a  non-negative  tuning  constant  or  hyperparameter  which  may 
be  adjusted  to  control  the  smoothness  of  the  fitted  y.  There  is  of  course  a 
Bayesian  interpretation;  see  Section  4. 

If  the  likelihood  L  is  that  of  independent  observations  yi#  Normally 
distributed  with  means  ^  linear  in  3  and  y(ti),  then  this  penalized 
likelihood  approach  is  equivalent  to  a  semi-parametric  linear  regression  as 
proposed  by  Green,  Jennison  and  Seheult  (1983,  1985)  in  the  context  of 
agricultural  field  experiments,  Engle,  Granger,  Rice  and  Weiss  (1983),  in  an 
economic  problem  and  Wahba  ( 1984) ,  who  with  co-workers  has  developed  a 
considerable  body  of  theory  for  such  'partial-spline'  methods.  Use  of 
penalized  likelihood  in  simple  generalized  linear  models  is  discussed  by 
O'Sullivan  in  his  thesis  (1983),  by  O'Sullivan,  Yandell  and  Raynor  (1984)  and 
by  Silverman  (1985).  Leonard  (1982)  considers  such  methods  for  a  variety  of 
curve  estimation  problems  from  a  full-blooded  empirical  Bayesian 
perspective.  In  all  of  these  papers  the  parametric  part  of  the  model  (x,3) 
is  not  present,  but  Wahba  (1985)  remarks  that  the  ideas  of  partial  splines  may 
be  combined  with  penalized  likelihood  for  generalized  linear  models.  In  none 
of  these  papers  in  the  general  regression  model  (1.1)  addressed,  and  typically 
identification  of  y  is  not  regarded  as  of  subsidiary  importance  to  the 
efficient  estimation  of  0. 


2.  THE  ESTIMATION  PROCEDURE 

We  begin  by  apparently  compromising  the  generality  of  prescription  of  our 
problem.  As  it  stands,  (1.1)  allows  the  predictor  \(i  to  depend  on  infinitely 
many  values  of  the  function  y.  In  practice,  therefore,  discretization  will 
be  necessary  at  some  stage.  Following  a  suggestion  of  Leonard  (1982),  in 
maximizing  (1.2),  we  will  restrict  y  to  lie  in  a  finite-dimensional  subspace 
of  G,  namely  F  =  span  {<}>..,  j  =  1,2,...,q},  for  a  prescribed  set  of  q 
basis  functions.  We  write 


'a 

8(0,5)  =  <M0,  1  Mi>  ' 

j=i  3  3 


(2.1) 


and  will  further  restrict  attention  to  roughness  penalties  of  the  form 
,  ?  T 

Jl  Z  J  ~  5  K5  for  some  fixed  q  *  q  non-negative  definite  matrix  K. 

j  =  1  3  3 

It  may  seem  that  we  are  abandoning  our  intended  semi -parametric 
framework,  but  it  should  be  stressed  that  q,  while  it  may  be  somewhat  less 
than  n,  will  still  be  ’large1,  and  parametric  estimation  of  5  will  not  be 
appropriate.  Further,  the  intention  is  that  F  and  G  should  in  practical 
terms  be  indistinguishable.  This  will  entail  appropriate  choice  of  { 4> ^ }  as, 
for  example,  a  large  class  of  orthogonal  polynomials  or  trigonometric 
functions  in  t.  This  choice  will  depend  on  the  observed  values  of  t,  and 
will  also  determine  K.  The  precise  quadratic  form  of  the  roughness  penalty 
is  hardly  necessary  in  what  follows,  but  it  simplifies  the  algebra  and  is  not 
likely  to  make  any  practical  difference. 

There  may  in  fact  be  no  restriction  at  all.  In  non-parametric 
regression,  G  is  typically  a  reproducing  kernel  Hilbert  space,  on  which  J 
is  a  squared  semi -norm.  Suppose  i|>  depends  only  on  (y(t^),  i  =  1,2,...,q}. 


Then  since 


min{j(Y)  :  Yft^)  =  5^,  i  =  1*2,. ..,q}  *  5  K5 
for  a  certain  K,  and  we  may  choose  F  to  consist  of  a  basis  of  spline 
functions  with  $  (t^)  “  6^,  the  original  and  the  restricted  problem  have 
the  same  solution  so  far  as  values  of  8  and  of  (Y(t^)}  are  concerned. 

We  therefore  maximize 


P  -  L<6<3,5)>  -  X£TK5  (2.2) 

over  3  e  pP,  5  e  1^,  where  0  is  a  prescribed  Revalued  function.  This 
revised  formulation  has  the  further  advantage  of  allowing  certain  new  problems 
into  our  framework,  that  could  not  otherwise  be  naturally  described  with  a 
vector  of  predictors  i|»  of  finite  length:  see  example  (d)  in  the  next 
section. 

We  will  only  be  concerned  with  problems  where  likelihood  methods  are 
appropriate:  we  suppose  sufficient  regularity  that  L  is  approximately 
quadratic  near  the  'true  values'  80,50.  A  modification  to  an  iteratively 
reweighted  least  squares  algorithm  derived  from  the  Newton-Raphson  method, 
with  Fisher  scoring,  (see  Green,  1984)  should  therefore  be  appropriate. 

Write 


3L 

U  =  10'  A 


4- A?]. 

300 


_30 

33' 


E 


30 

35 


The  scores  u  form  an  n-vector,  and  the  matrices  A,  D,  and  E  are 
n  x  n,  n  x  p  and  n  x  q.  All  of  these  quantities  in  general  depend  on  8 
and  5,  (in  the  case  of  u  and  A  only  through  0),  but  these  dependencies 
will  be  suppressed  from  the  notation.  The  expectation  is  taken  at  the  current 
values  of  3  and  5*  Differentiating  (2.2)  gives  the  modified  likelihood 
equations 

DTu  =  0  (2.3) 


ETu  -  XK5 


Their  solution  gives  our  required  maximum  penalized  likelihood  estimates 


(MPLE's)  0,  K •  Typically  these  equations  are  nonlinear  and  require  iterative 
solution.  The  Newton-Raphson  method  with  expected  second  derivatives  involves 
successively  replacing  trial  estimates  (0,£)*  at  which  u,A,D  and  E  are 
evaluated,  by  (0*,5*)  where 


T  T 

fD  AD  D  AE 


fU  AU  U  AE  ug*  -  0-|  ^  (D  U  •> 

l  m  m  Jl rt  .  c  J  v  m  / 

E  AD  EAE+XK  S  E  U  -  AK£ 


or,  equivalently. 


where 

G  =  H  ♦  (J  H  -  (°^)A[D  :  E]  , 

E 


(2.4) 


and 

Y  =  A_1U  +  D0  +  E?  .  (2.5) 

These  equations  have  the  form  of  a  combination  of  weighted  normal  equations, 
for  0*,  and  generalized  ridge  regression  equations,  for  £*.  The  two 
ingredients  are  found  separately  in  Green  (1984)  and  O'Sullivan,  Yandell  and 
Raynor  (1984).  See  also  Silverman  (1985,  Section  8.1). 

We  can  now  move  towards  stating  conditions  on  the  model  (1.1),  (2.1)  for 

T 

this  approach  to  be  applicable.  First  represent  K  as  L  L,  where  L  is 

r  x  q  of  full  rank  r,  which  is  usually  less  than  q.  If  so,  then  K  has  a 

non-trivial  null  space:  Let  T  be  q  x  (q  -  r)  such  that  LT  =  0  and 
[Lt  :  T]  is  non-singular.  Our  conditions  are  that  for  all  &»£»  the 
matrix  A  is  non-singular,  and  D,E  and  [D  :  ET]  have  full  rank  p,q 
and  p  +  q  -  r  respectively.  We  may  then  proceed  with  any  positive  finite 
A:  the  matrix  G  is  non-singular.  Convergence  of  the  iteration  (2.4)  is 


not  guaranteed,  but  in  practice  will  usually  occur  rather  rapidly  for  sensible 
initial  values.  The  algorithm  has  at  least  a  fixed-point  justification:  if 
(2.4)  gives  8*  *  0  and  Z*  =  Zt  then  (2.3)  is  satisfied. 

Jointly  with  Dr.  Brian  Yandell,  the  author  is  developing  various 
implementations  of  the  basic  algorithm  (2.4,  2.5).  Details  will  appear 
elsewhere. 


3.  SPECIAL  CASES 

The  general  model  (1.1)  makes  no  assumptions  about  the  independence  of 
random  terms,  or  additivity  and  linearity  among  systematic  components.  Of 
course  such  simplifications  are  sometimes  available.  If  the  log-likelihood 
L  is  that  of  n  independent  observations  {y^}  each  indexed  by  the 
corresponding  iji^,  then  A  is  diagonal.  If  0  is  linear  in  p  or  £, 
then  D  or  E  will  be  constant.  Such  properties  may  be  exploited  in 
algorithms,  but  do  not  affect  a  general  treatment. 

/ 

(a)  The  Linear  Normal  Case 

If  the  observations  y  are  independently  Normally  distributed, 

2 

y  ~  N(6,0  I),  with  a  linear  parameterization  0  =  D0  +  E£,  then  D  and 

-2  2 

E  are  constant,  and  A  =  0  I.  The  scale  factor  0  factorizes  from  both 

sides  of  (2.4),  so  may  be  ignored  :  this  is  an  example  of  a  more  general 

phenomenon  :  see  Section  9.  The  artificial  response  Y  in  (2.5)  is 

identically  y,  and  no  iteration  is  necessary.  If  E,£,X  and  K  are 

omitted,  we  have  the  ordinary  linear  model.  If  D  and  0  ar.1  omitted 

instead  then  we  have  a  model  including  ridge  regression  (when  K  =  I,  we 
A  T  “IT 

obtain  Z  -  (E  E  +  Xl)  E  y)  and  spline  \oothing  (as  described  in 
Section  2).  With  both  D  and  E  =  I  present,  this  covers  the  least-squares 


V 


smoothing  approach  to  the  analysis  of  agricultural  field  trials  due  to  Green, 
Jennison  and  Seheult  (1983,  1985).  They  used  a  roughness  penalty  based  on 
differencing  5  from  neighbouring  plots,  for  example 


5TK5  =  l  (5±  -  2? 


i+1  +  ?i+2* 


(3.1) 


In  this  application,  D  represents  a  designed  experiment,  and  the  resulting 
methodology  may  be  related  to  other  more  classical  approaches  (see  Green, 
1985). 

In  all  these  special  cases,  it  may  be  more  natural  to  focus  on  least- 
squares  rather  than  Normal  theory/maximum  likelihood  as  the  basic  principle. 

(b)  Logistic  Regression. 

To  continue  the  example  from  Section  1,  with  now 
ei  =  { 1  +  exp(“  1  xij$j  -  £±)}  1 ;  we  have  u±  =  (yi  -  nuG^/ie^l  -  6^}, 

A  is  diagonal  with  =  nu/tG^I  -  0^},  E  is  diagonal  with 

E^  =  0.^(1  -  8^ ) ,  and  Dij  =  0,^(1  -  ®^)xij*  equations  (2.4)  are  no 

longer  fixed  and  iteration  is  necessary.  An  appropriate  form  for  K  will 
depend  on  the  temporal  or  spatial  configuration  of  the  {t^}:  see  Section  4. 

(c)  A  Grouped  Continuous  Model. 

For  non-par ametric  regression  of  ordered  categorical  data  on  a  single 
explanatory  variable,  the  following  model  may  be  appropriate.  For 
r  =  1,2,...,R  we  have  a  S-vector  multinomial  response  {yrg,  s  * 
with  associated  probabilities  {prg}  assumed  to  satisfy 

s 

y  p  .  =  ^(8  -  5  ) 

ri  s  r 

for  some  prescribed  distribution  function  f,  where  5^  =  y(tr)  and  tr  is 
the  value  of  the  explanatory  variable  for  this  response.  This  grouped 
continuous  model  is  equivalent  to  the  assumption  of  a  latent  continuous 
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variable  with  distribution  function  'H*  -  £r)  which  is  categorized  into  S 

classes  at  the  unknown  outpoints  {$..»£,  »**.»0  to  yield  the  observed 

frequencies  {y  }.  See  McCullagh  (1980)  for  a  complete  discussion.  This 
JT  S 

falls  into  our  present  framework  if  we  take  0  as 

{6rs  =  0^  —  =  1,2, ... ,R;  s  —  1 , 2, . . • ,S  —  l},  so  that 

prs  =  ^(0rs)  -  ¥(0  1).  The  matrix  A  is  no  longer  diagonal,  but  D  and 

E  have  a  very  simple  form.  For  identif iability ,  one  component  of  5  must  be 

held  fixed  and  omitted  from  equations  (2.4). 

(d)  Multiple  Inhomogeneous  Poisson  Processes 

Suppose  m  point  processes  are  observed:  the  ith  process  is  observed 

for  the  time  interval  (s^,t^]»  and  yields  events  {y^/  k  =  1,2,...,n.}. 

The  observation  intervals  may  depend  on  the  realization:  for  example,  each 

process  may  be  observed  until  the  first  event,  as  in  survival  analysis.  If 

the  processes  may  be  modelled  as  independent  inhomogeneous  Poisson  processes 
P 


rates  <p(  l  x..0.,Y(t))  at  time 
j=1  -1  J 


t  for  the  i  process,  involving  an 


unknown  linear  combination  of  covariates  {x^j},  the  appropriate  log- 
likelihood  is 


m  t. 


I  r  log  xnei'Y(yik))  “  E  /  1  +(X  xiiPi*Y(y))dy  . 
i=1  k=1  13  3  i=1  Sj  13  3 


Here,  ij»  is  prescribed,  but  0  and  the  function  Y  are  to  be  estimated. 
Again,  a  Dirac  catastrophe  prevents  maximum  likelihood  estimation  of  0 
without  restrictions  on  Y*  Under  the  assumption  that  the  rates  are  simply 
expQ  x£j&j  +  Y(t))  with  appropriate  stopping  rules,  this  is  the 
proportional  hazards  model  (Cox,  1972,  Anderson  and  Senthilselvan,  1980),  and 
estimation  of  0  is  possible  via  partial  likelihood.  Otherwise,  the  approach 
described  in  Section  2  may  be  necessary. 
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CHOICE  OF  ROUGHNESS  PENALTY 


Various  authors  have  remarked  in  different  contexts  that  choxce  of  the 
amount  of  smoothing  (A  in  equation  (2.2))  is  more  important  than  the 
smoothing  kernel  K  itself.  This  might  be  amended  by  adding  that  the  null 
space  of  the  roughness,  the  space  {(  !  5  1?  =  0}  may  also  be  important;  for 
vectors  in  this  null  space,  since  they  are  not  penalized  in  (2.2),  are 
implicitly  also  fitted  as  covariates. 

Spline  smoothers  would  choose  a  penalty  of  the  form 


J(Y)  -  /  (Y(m>(t))2dt 


(4.1) 


for  a  curve  on  a  single-dimensional  variable.  As  mentioned  in  Section  2  this 
is  equivalent  to  our  approach;  the  kernel  K  is  given  by 


Kij  -  /  ^m>(t)*‘m)(t)dt  i,j  =  1,2, ...,q  . 


the  rank  of  K  will  be  q  -  m  for  any  spline  basis  (4>k  :  k  =  1,...,q),  the 

q 

null  space  of  K  consisting  of  those  £  for  which  £  £  $.  is  a  polynomial 

k=1 

of  degree  (m  -  1)  or  less. 

Wahba  (1978)  derives  spline  smoothing  from  a  Bayesian  model  in  which  an 
appropriate  prior,  which  is  partially  improper,  is  constructed  on  a  space  of 
smooth  functions;  see  also  Silverman  (1985).  In  the  notation  above,  the  prior 
is  a  multi v»riate  Normal  distribution  for  £  with  mean  0  and  inverse 
variance  matrix  AK.  Impropriety  of  the  prior  is  equivalent  to  deficient  rank 
in  K.  In  our  present  partially  parametric  context,  with  an  uninformative 
prior  for  0  as  an  additional  ingredient,  the  maximum  penalized  likelihood 


estimate  for  (0,£)  is  the  mode  of  the  corresponding  posterior 


3 


distribution.  He  can  make  the  prior  more  explicit,  whilst  avoiding 

impropriety,  as  follows.  Let  L  and  T  be  as  constructed  in  Section  2;  then 

we  can  generate  the  prior  for  £  as 

5  =  T6  +  LT(LLT)_1e  (4.2) 

where  6  is  a  fixed  (q  -  r) -vector  and  e  an  r-vector  of  zero-mean, 

uncorrelated  Normally  distributed  random  variables  with  variance  X  1 •  We  can 

T 

see  that  the  penalty  term  X£  is  indeed  twice  the  negative  of  the 
appropriate  log-likelihood  term. 

Leonard  (1982)  provides  a  more  completely  Bayesian  approach  for  the  non- 
parametric  case,  again  using  a  Gaussian  process  as  a  prior  for  y. 
specifically  he  recommends  an  Ornstein-Uhlenbeck  process,  with  two 
hyperparameters  in  place  of  X,  for  the  difference  between  the  derivative  of 
y  and  a  prescribed  or  estimated  base  curve.  The  full  empirical  Bayesian 
approach  allows  estimation  of  the  hyperparameters. 

If  the  observations  are  located  at  equally-spaced  {t.^}  on  a  line,  the 
squared  mt'1  derivative  penalty  (4.1)  will  in  practice  be  indistinguishable 
from  that  involving  mth  differences,  for  example  (3.1),  with  a  basis  such  that 
$j(tf)  =  ^ij*  Use  other  roughness  penalties  was  also  considered  by  Green 
et  al.  (1985). 

When,  and  only  when,  the  log-likelihood  L  is  that  of  a  Normal 
distribution  with  expectation  0  linear  in  C,  addition  of  the  roughness 
penalty  corresponds  to  use  of  a  'random  effects'  model  for  5,  or 
equivalently,  so  far  as  0  is  concerned,  to  modification  of  the  assumed 
variance  structure  for  y.  See  Green  (1985). 

Whatever  form  of  K  is  chosen,  the  tuning  constant  X  controls  the 
relative  impact  of  roughness,  as  judged  by  K,  and  error,  determined  by  the 
likelihood.  The  extreme  cases  X  +  «°  and  0,  between  which  we  wish  to 


compromise,  can  be  interpreted  as  follows.  As  X  ♦  00 ,  the  likelihood  is 

maximized  subject  to  a  roughness  penalty  of  0,  that  is  we  restrict  to  the 

purely  parametric  model  L(6(3,T6)).  As  X  0,  the  problem  degenerates  to 

T 

the  minimization  of  5  K£  subject  to  the  'interpolation'  condition  that 
0(B,5)  maximizes  L( 0(3,5)):  whether  this  is  a  constraint  on  £  depends  on 
the  form  of  0. 

5.  ASYMPTOTICS  AND  STANDARD  ERRORS 

For  various  reasons,  a  rigorous  treatment  of  the  asymptotic  theory  of  the 

maximum  penalized  likelihood  estimates  (B»5)  is  extremely  difficult:  the 

arbitrariness  of  the  probability  model,  and  its  parameterization,  the  presence 

of  the  roughness  penalty,  and  flexibility  over  which  parameter  dimensions  the 

asymptotics  are  to  be  with  respect  to.  We  introduce  a  hyperparameter  N, 

upon  which  q  and  n  may  depend,  and  consider  the  limit  N  ♦  «.  The 

dimension  p  of  the  parameter  3  will  remain  fixed,  but  other  quantities 

such  as  X  and  K  may  implicitly  depend  on  N.  It  is  clear  that  we  do  not 

require  n  «•:  for  example  in  parametric  logistic  regression  the  standard 

asymptotic  theory  applies  if  either  n  ♦  »  or  min{m^}  ♦  <*>.  For  the  purely 

parametric  case,  a  rather  general  result  is  given  by  McCullagh  (1983),  which 

we  believe  might  be  extended  to  the  present  case  if  q  remains  fixed  as 

—  1  T 

N  +  <*>.  His  condition  is  essentially  that  N  D  AD  has  a  non-singular  limit. 

In  the  spline  smoothing  case  the  usual  asymptotic  framework  is  one  where 
the  locations  {t^}  of  the  observations  become  increasingly  dense  in  a  finite 
interval  as  N  <*>  (Craven  and  Wahba,  1979»  Cox,  1984).  Hris  is  true  also 
for  the  non-Normal  fully  parametric  case:  see  O'Sullivan  (1983)  and  Cox  and 
O'Sullivan  (1983).  However  for  the  partial-spline/least-squares  smoothing 


case  it  is  intuitively  plausible  that  estimates  of  at  least  certain  contrasts 
of  0  should  be  consistent  and  asymptotically  Normal  even  when  the  spacings 
between  the  {t^}  do  not  decrease . 

Here  we  will  only  attempt  crude  heuristics.  Suppose  that  the  model 
L(0(0,5))  is  correct,  so  that  there  are  true  values  0O,5O*  (This  is 
restrictive:  we  would  hope  to  estimate  0Q  satisfactorily  without  such 
strong  assumptions  on  the  non-parametric  part  of  the  model ) .  We  will  use  the 

A 

symbols  A,  etc.,  on  other  quantities,  so  that  for  example  u  means 

A  A  A  A 

u( 0(0,£)).  For  some  point  (5,?)  between  (0,5)  and  (0O»5Q)»  we  have  by 
Taylor's  theorem: 


3P 

*50 


|f  /  6.5 


3P 

T0 

3P 

■55 


e  ,5n  . 

O  O  L 


32P 


aC|)C|JT 


6,5 


The  left-hand  side  is  0,  so 


( 


-  XK5 

o 


where  here,  and  later,  we  use  ~  to  m®an  asymptotically  equal,  under 
unspecified  assumptions.  We  have  E(uQ)  -  0,  var(uQ)  *  Aq,  and 

a  a  a  a 

G  ~  G©  ~  G,  Aq  ~  A,  Dq  ~  D,  Eq  ~  E •  So  if  the  central  limit  theorem 
applies  in  the  usual  way,  we  would  expect  that  the  asymptotic  distribution  of 

A  A 

(0,5)  is  approximately 


( 


0 

A 

5 


) 


(5.1) 


where  G  and  H  may  be  estimated  by  G  and  H 


Obviously  much  remains  to  be  proved,  but  these  arguments  do  suggest  that 

A 

we  may  at  least  use  (5.1)  to  derive  nominal  standard  errors  for  5,  assuming 
that  we  have  chosen  X  depending  on  N  in  such  a  way  that  the  penalized 
likelihood  estimates  are  consistent. 

6.  DEVIANCE  AND  DEGREES  OF  FREEDOM. 

In  the  spirit  of  the  previous  section,  an  analogue  of  the  usual 

likelihood  ratio  statistic  may  be  constructed,  under  a  regularity  condition. 

He  need  the  notion  of  'saturated  model*  to  be  well-defined,  in  the  following 

way.  Let  -  sup  L(6)  *  L(6'),  say,  be  the  maximum  value  attained  by 

6 

L(0)  when  the  n-vector  8  of  predictors  is  freed  from  its  functional 
dependence  on  0  and  £.  We  suppose  Lmax  <  <*>.  Define  the  deviance 
associated  with  a  particular  regression  function  0(0,£)  and  given  values  8 
and  £  by 

A  *  2(1^  -  L(8(0,5))}  ; 

note  that  because  of  the  penalization,  our  estimates  do  not  minimize  A,  but 

rather  the  penalized  deviance 

A  +  X£TK5  =  2{h  -  P(8,5)>  • 

max 

In  certain  circumstances  the  information  matrix  A  may  be  approximately 

A  A 

constant,  and  the  likelihood  approximately  quadratic,  near  6*  and  6(0, £), 

A  AAA  A 

in  which  case  since  u(8')  ■  0  we  have  u  **  u(0(8,5))  ~  A(8*  -  0),  whence 

A_A  M  A 

A  ~  u1A-1u.  We  will  refer  to  this  latter  expression  as  the  linearized 
deviance . 

Just  as  is  standard  practice  with  generalized  linear  models  (Nelder  and 
Wedderburn,  1972),  model  selection  in  the  form  of  choice  of  a  regression 
function  0(0, £)  may  be  based  on  nominal  significance  tests  using  the 
approximate  asymptotic  distribution  of  the  deviance.  Similar  heuristics  to 


2 

those  used  above  justify  a  X  approximation  on  an  appropriate  "equivalent" 
degrees  of  freedom,  which  will  not  in  general  be  integral,  defined  via  the 
asymptotic  expectation  of  the  deviance. 

We  have 

E(A)  ~  E(u  A  u)  . 

A  A  A  A 

But  A  ~  A  and  u~U--A(D(B“B)+E(5-C))*  so  from  the 
=  O  “UOO  o  o  o 

A  A 

asymptotic  distributions  of  (B#S)  in  Section  5,  we  find 

E(uTA_1u)  ~  (0  XKe0)TG"1(DT)AA"1A(D  E)G_1(A°5  ) 

E  0 

+  trfA”1 [(i  -  A(D  :  E)g'1(Dt))a(i  -  (D  :  E)G'n(DT)A)]} 

E  ’  E 

-  tr(M2)  +  Qn  -  Q2  say 
T 

where  M  *  (i  -  B  (D  E)G  (  t)b),  B  is  a  square  root  of  A  «=  BBT,  and 

E 

Qj  -  £(XK<ETAE  +  XK  -  Note  that  0  <  <  Qj  +  1' 

with  mutual  equality  if  and  only  if  K£0  =  0.  Turning  now  to  the  penalty  term 
evaluated  at  the  MPL  estimates, 

e(x£tk5)  =*  [5*  +  (o  :  (-xk5o)t)g"1(J)]xk[5o  +  (o  :  DG"^^  )] 

O 

+  tr (XK( 0  :  I)G-1(Dt)a(D  :  E)G_1(°)) 

E 

-  tr (M  -  M2)  +  Q0  -  2Q1  +  Q2  . 

Thu 8  the  penalized  deviance  has  asymptotic  expectation 

E(A  +  X5Tk£)  ~  tr (M)  +  (Q0  -  Qt)  . 

The  utility  of  these  expressions  is  limited  by  presence  of  the  correction 
terms  involving  quadratic  forms  in  the  unknown  tiue  Three  possible 


'A'  .V  r  ..W  j  j 


approaches  include  (i)  neglecting  these  terms,  leading  to  an  underestimate  of 
degrees  of  freedom  and  hence  conservative  tests,  (ii)  estimating  these  terms 
from  the  data,  a  necessarily  hazardous  activity,  or  (iii)  replacing  them  by 
their  expectations  under  the  prior  distribution  (4.2)  that  was  used  to  justify 
the  penalty  function. 

T  T 

For  the  latter  approach,  we  have  E (X£QK£o)  =  E(Ae  e)  =  r,  and 
for  any  q  x  q  matrix  N,  E ( ( Ak£q )TN( AK5q ) )  =  E( ALTe)TN( ALTe) ) 

=  tr(AKN).  Thus  E(Qj)  =  tj,  say,  where  tg  =  r,  and  tj  = 
tr{AK(ETAE  +  AK  -  ETAD(DTAD)_1DTAE)}j  for  j  =  1,2,3,...  .  But  a  little 
manipulation  reveals  that  tr(M^)  *  n  -  (p  +  q)  +  tj  for  j  =  1,2,3,...,  so 
that  under  the  prior  (4.2)  the  terms  in  the  penalty  function  have  asymptotic 
expectations,  at  (B,€),  with  the  simple  forms: 

E(uTA-1u)  ~  n  -  (p  +  q)  +  t.  *  tr(M)  *  v  , 

say 

E(a£Tk£)  ~  tQ  -  t1  -  n  -  (p  +  q)  +  r  -  v 

It  may  be  shown  that  for  any  A,  v  satisfies  the  inequality 
n  -  rk[D  !  E]  <v<n-(p+q)+r. 

(Its  exact  form  as  a  function  of  A  when  D,E,A  and  K  are  constant  is 
given  in  Green  (1985).  In  the  notation  of  that  paper,  R  is  ET  and  V  is 
I  +  A"1E(ETE)“1LT(L(ETE)"1LTr2I,(ETE)"1ET).  Combining  this  with  the 
information  above  about  asymptotic  expectations  supports  the  use  of  v  as  a 
surrogate  error  degrees-of-freedom.  Parameters  corresponding  to  the  columns 
of  [D  !  ET]  (i.e.  0  and  6)  are  always  fitted,  requiring  (p  +  q  -  r) 

degrees  of  freedom,  and  the  remaining  n  -  (p  +q)  +r  -v  are  associated 
with  the  non-parametric  component  of  5  permitted  by  A  <  «°.  Of  course,  no 
formal  distribution  theory  involving  v  is  known:  it  should  only  be  used 
informally  to  gauge  model  adequacy  as  measured  by  the  deviance. 
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7.  CROSS-VALIDATION 

Model  selection  by  means  of  cross-validation  was  discussed  in  a 
systematic  way  by  Stone  (1974).  Ita  use  in  determining  an  appropriate  degree 
of  smoothing  in  non-parametric  regression  problems  has  been  enthusiastically 
espoused  by  Wahba  and  co-workers  in  the  past  ten  years,  and,  in  a  refined  form 
known  as  generalized  cross-validation  (GCV)  (Wahba,  1977)  seems  to  have  become 
the  de  facto  standard  approach.  In  the  linear  problem,  GCV  has  additional 
invariance  over  the  ordinary  version,  it  has  now  become  computationally 
practicable,  and  it  is  known  to  provide  an  asympotically  optimal  degree  of 
smoothing  in  a  predictive  mean-square  sense.  O'Sullivan  (1983)  generalizes 
the  application  of  GCV  to  generalized  linear  models  by  transcribing  a  formula 
from  the  linear  case,  without  deriving  the  criterion  afresh  from  its  plausible 
first  principles.  Thus  we  attempt  to  do  here,  for  our  more  general  class  of 
regression  problems. 

The  basic  idea  in  cross-validation  is  to  delete  one  observation  at  a  time 
from  the  data  set,  and  endeavour  to  predict  it  from  the  model  as  fitted  to  the 
remaining  observations.  The  smoothing  parameter  is  chosen  to  optimize  the 
overall  quality  of  prediction.  The  appropriate  generalization  of  this 
'delete-one'  operation  in  our  model  L( 6 ($,£))  consists  of  decoupling  each 
component  of  8  in  turn  from  its  dependence  on  8  and  The  predictive 

discrepancy  will  be  measured  in  likelihood  or  deviance  terms. 

The  decoupling  is  achieved  by  the  introduction  of  dummy  covariates.  For 
some  genrality,  let  F  be  an  arbitrary  n  x  f  matrix,  f  >  1,  and  let 
0,  £  and  t  maximize  the  penalized  decoupled  likelihood 

L(0(#,£)  +  Ft)  -  -i  We  define  the  predictive  discrepancy  in  the  column 

space  of  F  as  the  non-negative  quantity 

A+(F)  =  2{L(0(?,?)  +  Ft)  -  L(0(2,£))}  »  (7.1) 
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if  this  is  zero  then  the  decoupled  estimates  ($,?)  coincide  with  (3,5)*  We 
will  average  A*(F)  over  an  appropriate  set  of  directions  to  give  an  overall 
predictive  discrepancy,  but  first  we  obtain  a  linear  approximation  for 
A+(F). 


At  (3,5,T)  we  have 


where 


•JVw 

DU  =  0 
E  u  «  Ak5 

T~ 

F  u  =  0 


u  -  u(0(3,5)  +  ft) 

-  u( 0{ 3,5) )  -  A(D( 0  -  3)  +  E(5  -  5)  +  Ft) 


ijia  ipA  A 

But  we  know  that  D  u  ■  0  and  E  u  =  XK5,  so  by  subtraction,  treating  A,D 


fixed 

(evaluated  at 

(3,5), 

say),  we 

T 

DXAD 

T 

DAE 

dtaf^ 

T 

E  AD 

etae  +  Ak 

etaf 

«-> 

T 

FAD 

T 

FAE 

ftaf  j 

l\  T 

whence  3  and  X  may  be  eliminated  to  give 


/T  _  _T.,_  •  -1  I'D  \  — 1  T*  .  _T  T_. — 1 

~  (.F  AF  -  F  A(D  .  E)G  JAFJ  F  U  =  (F  BMB  F)  F  u  . 


So,  by  linearizing  the  expression  (8.1)  and  noting  that  (?,5,Y)  maximizes 

the  first  term,  penalized,  we  have 

*  ~  T  ~  *T  T  T  -1  T  T  T  -1  TA 

A  (F)  ~  (FT)  A(FT)  *  u  F(F  BMB  F)  F  AF(F  BMB  F)  F  U  .  (7.2) 

This  expression  measures  the  result  of  decoupling  any  number  f  of  components 


of  0*  for  an  analogue  of  delete-one  cross-validation  we  set  f  **  1.  For 


example  if  F  =  e*A*,  the  unit  vector  in  the  ifc^  coordinate  direction, 

t  »,  \A2 

A  <•  >  : — ■  i  ;  • 

If  A  is  not  diagonal,  we  may  prefer  to  rotate  lP  and  have 

A+((B_1)Te(i) )  ~  <b“1G)J/mJ1  . 

In  generalized  cross-validation  the  individual  predictive  discrepancies  are 
combined  over  different  directions  by  a  weighted  sun  enjoying  certain 
invariance  properties.  Let  w^  =  M^/trjM),  so  that  'i  w^  =  1,  and  define 
the  GCV  criterion 


V( X)  -  l  w2A+((B_1)Te(i))  ~  (u  Vu)/<tr(M))4  ~  A/v4  . 
i-1 


,2  -  .2 


We  can  choose  X  to  minimize  this  quantity,  which  has  the  same  form  as  that 
used  by  O'Sullivan  (1983). 

That  this  is  the  correct  weighting  of  the  individual  predictive 

discrepancies  can  be  seen  by  examining  the  invariance  properties  of  V(X). 

Re-parameterization  by  invertible  appropriately  differentiable  transformations 

of  0,3  and  £  does  not  change  the  model;  it  alters  u,A,D  and  E,  but 

u  A  u.  A,  m  and  v  remain  invariant,  and  so  therefore  does  V(X). 

One  justification  for  use  of  the  GCV  criterion  in  the  linear  (spline) 

case  is  provided  by  the  result  of  Craven  and  Wahba  (1979)  stating  that  such  a 

criterion  is  asymptotically  optimal  in  the  sense  of  minimizing  the  mean- 

1  ^  A  2 

squared  error  R(X)  -  —  £  (Y(t.)  -  Y  (t,))  ♦  (Of  course,  this  property  may 

n  i=1  1  x 

be  shared  by  many  other  criteria  for  choosing  X).  The  only  natural 
expression  of  R(X)  in  likelihood  terms  seems  to  be  via  the  divergence 
defined  by  Kullback  and  Leibler  (1951).  We  define  R(X)  so  that 


.. 

•  ^  \  / 
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nR(A)  -  2E.  (L( 0  )  -  L(  0)  )  ~  (0  -  0  )TA(0  -  0  )  ~  (u  -  u  )TA-1(u  -  u  ) 
o  o  =  o  o=*o  o 

o 

A  A 

to  give  an  appropriate  weighted  m.s.e.  for  (6*5)*  which  makes  connections 
with  the  linearized  deviance  apparent*  Cross-validation  and  Kullback-Leibler 
distance  are  also  discussed  by  Bowman,  Hall  and  Titterington  (1984). 
O'Sullivan  (1983)  sketches  an  argument  suggesting  that  the  Craven  and  Wahba 
result  extends  to  the  generalized  linear  model  case,  with  a  definition  of 
R(A)  equivalent  to  the  above;  it  therefore  seems  likely  that  if  his  proof 
could  be  rigorized,  it  might  apply  to  the  present  more  general  setup  as  well. 
Note  that  by  arguments  similar  to  those  of  Section  6,  we  have 
E( R( A) )  ~  Q1  -  Q2  +  tr( (I  -  M)2)  , 
whose  expectation  under  the  prior  for  $q  is  just  tr(I  -  M) . 


8.  RESIDUALS. 

How  best  to  define  residuals  depends  very  much  on  the  purpose  co  which 
they  are  to  be  put.  The  multitude  of  definitions  available  even  in  simple 
linear  regression  models  (see  Cook  and  Weisberg,  1982)  strongly  suggests  that 
even  more  alternatives  will  be  available  in  our  present  general  context.  Here 
we  attempt  only  a  limited  discussion. 

We  seek  residuals  primarily  for  diagnostic  purposes,  and,  in  view  of  our 
reliance  on  the  likelihood  function  L(  6(6,5)),  prefer  these  to  be 
likelihood-based  and  associated  with  the  predictors  0  rather  than  directly 
with  the  observations.  Use  of  such  residuals  for  diagnosis  of  data-inadequacy 
will  require  inspection  of  the  likelihood  function  to  determine  the  data- 
points  instrumental  in  giving  a  particular  component  of  0  a  large 
residual.  Detection  of  model  inadequacy  can  proceed  more  directly,  and  note 
that  we  not  desire  invariance  of  residuals  to  transformation  of  0  itself. 
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The  likelihood  emphasis  suggests  concentrating  on  the  deviance  A, 
defined  in  Section  6.  Restricting  attention  to  one  or  more  particular 
components  of  9,  define 

A(F)  -  2 {sup  L(0(0,£)  +  Ft)  -  L(0<0,£))}  ,  (8.1) 

T 

twice  the  maximum  increase  in  log-likelihood  attained  by  freeing  0  from  its 
dependence  on  0  and  5,  in  the  directions  spanned  by  columns  of  F.  If 
F  is  non-singular,  A(F)  =  A.  Note  that  A(F)  <  A*(F);  the  sole  difference 
between  the  two  quantities  lies  in  the  inclusion  or  exclusion  of  the 
corresponding  components  of  0  from  the  fitting  of  the  model.  Also  note  that 
the  maximum  penalized-likelihood  ratio  statistic  2 {sup  P( 0(0,5)  +  Ft) 

-  sup  P(0(0,£))}  lies  between  A(F)  and  A+(F). 

Choosing  a  single  coordinate  direction  e^^  for  F,  we  obtain  the  raw 
deviance  and  discrepancy  A(e^  )  and  At(e^i^)  respectively,  which  we 
abbreviate  as  A^  and  A*.  The  raw  deviances  A^^  have  been  customarily  used 
to  define  residuals  in  generalized  linear  models  (see  discussion  in  Green, 
1984).  Finally,  we  denote  the  signed  square-roots  by  z^  =  sign(Tmax)/A^  and 

z?  *  sign(T  )/  A?,  where  t  denotes  the  value  of  t  in  the 
i  3  max  i  max 

maximization  of  (8.1)  and  in  (7.1)  respectively. 

These  concepts  tie  in  well  with  other  treatments  of  residuals.  In  the 
case  of  Normal  linear  regression  (with  known  variances  =  1,  say,  for 
simplicity),  z^  and  z*  are  just  the  ordinary  and  predicted  residuals, 
respectively,  of  Cook  and  Weisberg  (1982,  Chapter  2).  These  are  known  to  be 
correlated,  and  improperly  standardized  for  variance.  Cook  and  Weisberg  point 
out  that  z^  and  z*  respectively  under-  and  over-emphasize  discrepancies 
for  high-leverage  data  points.  This  difficulty  will  persist  in  the  more 


general  cases 


When  y  is  distributed  Normally  with  expectation  9  =  Dfi  and  known  non¬ 
diagonal  variance  matrix  V  we  find 

=  {V-1(y  -  0)}i{<v"1)ii}“1/2 


and 

z*  «  (V-1 (y  -  0)}i{Cv“1)ii}1/2{(V-1  -  v"1D(DTD)"1DTv“1)ii}"1 

which  are  in  fact  yA  standardized  by  its  expectation  and  variance  assuming 
the  parameters  to  be  equal  to  their  estimates  with  and  without  yi# 
respectively,  conditional  on  the  other  components  of  y. 

In  contrast  to  Jorgensen  (1984),  we  believe  that  the  use  of  unconditional 
moments  in  this  standardization  is  inappropriate  for  dependent  observations. 
For  these  examples,  the  linearization  leading  to  (8.2),  and  by  a  simpler 

aT  «p  .  ^ 

argument  to  MF)  ~  u  F(F  AF)  F  u  involves  no  approximation.  In  general, 
when  the  likelihood  is  not  quadratic,  replacing  A^  and  A*  by  these 
approximations  leads  to  different  residuals  z^  and  z*.  The  former  have 
been  called  'score  residuals'  (Jorgensen,  1983).  But  as  pointed  out  by  Green 
( 1984) ,  these  score  residuals  are  not  appropriate  when  the  quadratic 
approximation  is  badly  wrong:  for  example ,  they  are  not  monotonic  functions 
of  the  observations  in  linear  regression  with  a  prescribed  error  density  that 
is  not  log-concave. 

9.  NUISANCE  PARAMETERS. 

The  approach  to  penalized  likelihood  estimation  described  here  can  handle 
with  no  difficulty  certain  types  of  nuisance  parameter  entering  the 
probability  model  in  addition  to  the  predictors  9.  Suppose,  following 
Jorgensen  (1963)  that 

L  -  L(y»9,<)  -  c(y;K)  +  o(tc)t(y»0) 
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where  0(  ),  which  we  might  term  the  precision  parameter,  is  a  scalar 

function  of  the  possibly  vector-valued  nuisance  parameter  k.  See  also  Green 

(1984).  This  is  in  a  sense  the  ultimate  generalization  of  the  property  of 

generalized  linear  models  in  which  the  scale  parameter  factors  out  from  the 

fitting  procedure  and  is  estimated  at  convergence  from  the  deviance.  Examples 

include  the  variance  in  the  Normal  distribution,  the  index  in  the  gamma 

distribution,  and  also  the  extra  parameter  often  allowed  as  a  modification  of 

the  binomial  or  Poisson  distributions  to  allow  for  'over-dispersion' . 

The  maximum  penalized  likelihood  estimates  of  0  and  £  now  satisfy 

Ta 

CD  u  =  0 

mA  * 

OE  u  =  XK£  . 

Fisher  scoring  is  no  longer  available  necessarily,  because  the  expectation  of 

a2t 

t(y;0)  will  in  general  involve  k,  but  if  we  write  A  =  - —  then 

300T 

neglecting  the  second  derivatives  of  0  with  respect  to  0  and  £,  (the 
"linearization  method"  of  Jorgensen  (1983))  we  obtain  the  approximate  Newton- 
Raphson  iteration 


demonstrating  that  0  and  £  can  be  estimated  without  paying  any  attention 
to  the  nuisance  parameter  tc,  except  that  the  value  of  the  tuning  constant 
X  is  now  effectively  measured  with  respect  to  the  unknown  precision  o,  a 
consequence  that  is  not  likely  to  be  of  any  serious  concern. 
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